CI/CD Pipeline Orchestration

Releasing Stale Terraform State Locks in AWS DynamoDB After Pipeline Failures

Last Tuesday at 2 a.m., I received a notification on my phone that made me sick to my stomach as a DevOps engineer. The notification indicated that our CI/CD pipeline’s Terraform apply had timed out. The runner had sent a SIGTERM from Jenkins, which killed the job and left the state file locked. Any subsequent pipeline runs that followed had reached a dead end quickly because of the familiar error message that stated: error acquiring the state lock dynamodb.

I have looked at that same screen for far too many times to count. After going through the process several times (one time I rushed through the process and made it worse), I am now going to provide detailed information on what works, what does not work, and how to avoid the issue in the future.

Quick Summary

  • How does a combination of S3 + DynamoDB manage state-locking process
  • Why a cancelled pipeline still results in orphaned locks
  • Instructions for safely using terraform force-unlock when it fails
  • Script and CI runner modifications to prevent future stale locks

Understanding the Terraform Backend S3 Lock Timeout

Terraform does not simply locate the state file and start applying it against the Cloud Provider when you use an S3 backend with a DynamoDB table to lock the state. Instead, the first thing it does is write a lock item to the DynamoDB table for the state. If the lock item already exists when another terraform plan or apply run starts, the job will timeout after five minutes (default wait period), at which point you will see the terraform backend s3 lock timeout error message.

S3 and DynamoDB State Locking Architecture

The backend block should look like this:

terraform {
  backend "s3" {
    bucket         = "my-org-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
  }
}

The purpose of the above configuration is to have Terraform save the state in an S3 bucket and to use a DynamoDB table called terraform-state-locks to provide locking capabilities for the state file. Terraform documentation regarding the S3 Backend states that this lock is taken out prior to running any commands that modify the state file. If the process that was holding the lock fails abruptly (e.g., because of a power outage), then the lock will remain until it gets deleted—there is no lease or heartbeat mechanism that get created.

Recognizing the Stale Lock Symptoms

You may see a message in your pipeline’s output similar to the following. The key part of this message is the key ID of the lock being held.

Error: Error acquiring the state lock

Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        a1b2c3d4-5678-90ab-cdef-EXAMPLE11111
  Path:      prod/network/terraform.tfstate
  Operation: OperationTypeApply
  Who:       i-1234567890abcdef0@jenkins-runner
  Version:   1.5.7
  Created:   2024-10-25 06:12:34.567891 +0000 UTC
  Info:      

Terraform acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.

This signifies a possible terraform force-unlock dynamodb state lock error. Since the lock will never expire on its own, you will need to delete the lock.

Root Causes for the Error Acquiring the State Lock DynamoDB

Cancelled CI/CD Jobs and Dropped Connections

Most CI/CD systems send a SIGTERM signal to a job that exceeds its timeout or has been cancelled by the user. However, SIGTERM will not necessarily be caught gracefully by Terraform, meaning it will exit abruptly and leave the lock in the state that it was in (e.g., the DynamoDB lock will be left in place).

Network failures also cause the same problem as the CI/CD systems. If a client has a broken connection partway through applying a Terraform file, then the lock will remain in place because the client will not be able to send a delete request for the lock.

Analyzing the Dynamo DB Lock ID Format Terraform Generates

If you look at the JSON object of your lock item within your dynamo DB lock table, you will see the following format.

{
  "LockID": {
    "S": "my-org-terraform-state/prod/network/terraform.tfstate-md5"
  },
  "Info": {
    "S": "{\"ID\":\"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111\",\"Operation\":\"OperationTypeApply\",\"Who\":\"i-1234567890abcdef0@jenkins-runner\",\"Version\":\"1.5.7\",\"Created\":\"2024-10-25T06:12:34.567891Z\"}"
  }
}

The dynamo db lock id format terraform creates is made by combining the name of your S3 bucket with the state file path, followed by an MD5 hash string.If you’re going to use the CLI, you’ll want to have that exact LockID string handy, as you’ll need it later.

Insufficient IAM Permissions for DynamoDB Operations

If the runner is unable to read from or write to the DynamoDB table, it may not necessarily be due to a stale lock. This may occur when the IAM role does not have sufficient permissions, specifically, the IAM user does not have amazon-dynamodb:GetItem amazon-dynamodb:PutItem or amazon-dynamodb:DeleteItem permissions for the lock table

This may create an error message that states ERROR acquiring the state lock, but it is actually misleading.

What Didn’t Work for Me

At first, I found a lock stuck in the AWS console and, without checking if anyone else was applying, deleted the item associated with the lock. I found out later that a coworker was running terraform destroy on that same state. The terraform destroy completed partially, and the file became corrupted. I spent several hours restoring from a versioned S3 backup.

So, my golden rule: Never remove a lock unless you are absolutely certain that no valid process holds it.

Fixing the Terraform Force-Unlock DynamoDB State Lock Error

Step 1: Verify the Lock Owner and Status

Run terraform plan or simply terraform state list from the same directory you ran the command you received the error on. The error output includes the Who field (see above). You can check to see if that Who field (often an instance ID and runner name) is still running. If it refers to a dead job, you can proceed.

Step 2: Execute the Force-Unlock Command

Copy the Lock ID from the error message. Then run terraform force-unlock [Lock ID].

terraform force-unlock a1b2c3d4-5678-90ab-cdef-EXAMPLE11111
Do you really want to force-unlock?
Terraform will remove the lock on the remote state.
This is a dangerous operation and should only be performed
when you are certain no one else is running Terraform
against this state. Only 'yes' will be accepted to confirm.

Enter a value: yes

Terraform state lock has been successfully force-unlocked.

DynamoDB’s item can be deleted by the AWS CLI in a proper manner that retains the database schema and prevents any orphaned metadata from being left behind.

Step 3: Validate State File Integrity in S3

After unlocking the state, make sure that it has not been corrupted by running the following command and verifying that it produces an exit code of 0:

terraform state pull > /dev/null

If it does, the state is not corrupted. If the command does not produce an exit code of 0, restore from a versioned S3 backup as soon as possible.

Edge Case: Executing a Manual State Unlock Terraform CI/CD Bypass

If you are attempting to execute the unlock action for the state file but that does not work, it can happen if you have changed the configuration for your remote state store, you have moved your state file to a different location, or it could be that you changed the schema of the DynamoDB table that is used by your CI/CD pipeline. In that case, perform this operation as a manual unlock terraform ci/cd bypass.

Bypassing the CLI via the AWS Management Console

In order to accomplish this, go to the AWS Management Console and open the DynamoDB service. Open Tables > terraform-state-locks (or your table name) > Items tab. Use the Scan or Query to find the item(s) in the table that contain the LockID (the lock identifier that you will be attempting to release) and select the item.

After finding the correct lock, click Actions > Delete and confirm the deletion.

Deleting the Lock Item Directly from the DynamoDB Table

If you prefer to delete the lock via CLI, the DynamoDB DeleteItem API call follows a slightly different syntax; you need to use the primary key of the lock to identify and delete it. The following command allows you to delete the lock from your DynamoDB table.

aws dynamodb delete-item \
  --table-name terraform-state-locks \
  --key '{"LockID": {"S": "my-org-terraform-state/prod/network/terraform.tfstate-md5"}}'

Replace LockID with the exact string of the lock you wish to delete from the table. The GUID component of the lock does not need to be provided; you can delete the entire item using its primary key.The lock is removed immediately. Then check to make sure the state is okay.

Prevention: How to Handle SIGTERM Terraform Pipeline Terminations

Configuring Graceful Shutdowns in CI/CD Runners

Using an action in GitHub Actions you can catch the SIGTERM signal and perform graceful shutdown cleanup. Below is an example of how to do this in a composite action or job:

jobs:
  terraform:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - name: Terraform Apply with trap
        run: |
          cleanup() {
            echo "Received signal, running terraform force-unlock..."
            terraform force-unlock -force $LOCK_ID || true
          }
          trap cleanup SIGTERM SIGINT
          terraform apply -auto-approve

It is necessary to capture the lock ID at the beginning or use a wrapper that knows how to find it. Once this is done, if the job gets canceled, the sigtrap will fire and allow you to unlock the lock.

How to Resolve Concurrent Pipeline Runs Terraform Lock Conflicts

When more than one pipeline is triggered simultaneously against the same state, it results in lock contention. Therefore, to avoid concurrent pipeline runs terraform lock conflicts, in GitHub Actions you can use “workflow concurrency group” and GitLab CI you can use “resource_group”. The result is the job will be queued, increasing the probability that there will not be job overlap and the locking mechanism will not be competing for the lock.

Implementing Wrapper Scripts for Safe State Releases

Using a simple bash wrapper that always invokes the force-unlock method in the event of a failure keeps the pipeline clean:

#!/bin/bash
set -e
LOCK_ID=""
trap '[[ -n "$LOCK_ID" ]] && terraform force-unlock -force "$LOCK_ID" || true' EXIT
terraform apply -auto-approve &
wait
LOCK_ID=$(terraform output -raw lock_id 2>/dev/null || true)

This ensures that even if the job dies between apply and sigtrap firing, the wrapper script still attempts to unlock the lock. The wrapper script is not foolproof but it will cover most cases.

Frequently Asked Questions

Is it safe to force-unlock a Terraform state file?

Only if you are absolutely certain that no other Terraform processes are writing to the state at the same time. Unlocking the lock while another account is performing an apply will corrupt the state. Always check the Who information associated with the lock and verify it with your teammates.

How long does a Terraform DynamoDB lock last if left alone?

Forever. DynamoDB does not provide TTL on these items unless set by the TimeToLive setting (which is not something you should normally do). The lock will remain until deleted by someone.

Can I disable state locking for my S3 backend entirely?

Yes. You can specify -lock=false on your commands or you can omit the dynamodb_table from the backend configuration. However, that is a reckless thing to do if you have at least two developers or pipelines that may touch the same state. You will likely corrupt the state file and have a long sleepless night.

Therefore, the next time you experience a lock error on your CI Pipeline, verify the owner of the lock, perform a force-unlock to release it, and harden the shutdown traps in your CI Pipeline so that eventually you won’t have this issue again. Then, sleep.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button